Facebook Ads Data Analysis¶
In the following notebook, it is explored methodologies and limitations of using the Facebook Graph API to access the Facebook Ads library and perform data analysis on the 2021 Dutch General Elections and get key insights of advertisement practices

The notebook is divided in 3 main sections:
- 1. Accessing the Facebook API as General user and Marketers
- 2. Accessing the Facebook API as Developer
- 3. Discovering Key Insights Analysing Ads Data
The first section discusses how to use and access Facebook ads data from a general user perspective via the Facebook Ads Library service, it is explored doing a generic search. The second section shows how to connect directly to the Graph API in developer mode and gather advertisement data directly, and thus perform data analysis. Finally, the third section presents a pilot data analysis exploring some key insights that could be useful for Consumer Protection Investigations or Compliance on Competition Law practices
Key Insights Summary¶
In this pilot data analysis, advertisement data has been collected on more than 8k ads using Facebook's Ad Library API. It is been searched the API by all advertisers and only include the period range around the Dutch General Elections in 2021, i.e. January to April 2021, all campaigns collected were inactive ones. We have observed that the average time of ads impressions is 4 days to the consumers, particularly, the 65+ were the most microtargeted group with ads impressions for almost a month. Interestingly, there were certain ads that microtargeted very specific groups reaching a huge audience (more than a 1M impressions) in a short amount of time (less than a day), these came from Jonge Socialisten in de PvdA, Wopke Hoekstra (CDA) and Woonbond, targeting 18-24 female, 18-24 male and 65+ male respectively. The party CDA was by far the one with more unique ads created with more than 2k, followed by smaller parties like Volt with 409 and DENK with 329. Furthermore, CDA spent up to 150,000 Euros on campaigns it wasn't the one that spent the most. Forum voor Democratie (FVD) spent up to 220,000 Euros in ads campaigns. The least microtargeted regions were Friesland and Zeeland, meaning that there were no campaigns that were intended exclusively for those regions. Marketers choose the last Friday before the election days (March 12) to launch the majority of the campaigns (3-4 days prior to voting), the ads were targeted generally in all provinces except Zeeland, Groningen and Friesland. In terms of topics, there was no special patterns found, only the mentioning of lockdown and crisis plan was new. Additionally using text similarity metrics (Levinstein distance) between the links and the pages we did not find doubtful links that would link to dubious websites. Note that some advertisers were mistakenly classified as political by Facebook's algorithms because their ad texts may contain words associated with political issues like "crisis", "environment" or "freedom".
Data acquisition and analysis were done using the Python programming language. Plotly for visualization and word frequencies by using the Natural Language Toolkit. The ads data and code used for the analysis can be found in the Github repo of this work. Any comments, please contact us at p.hernandezserrano@maastrichtuniversity.nl
The Facebook Ads Library is an open repository of all the ads and campaigns, active and inactive from many countries that Facebook runs. This repository is exposed as a service with a search engine interface that will help users to easily look up for ads, campaigns, Facebook pages, etc. in the Facebook ads database, this service is naturally connected to the Facebook Graph API. The search interface has two filters Country and Ad Type.
Limitations:
- The service is limited for the user to select one country at a time
- The service only allows to look up
All adsat once or theIssues, Elections or Politicscategory. Unfortunately Facebook is not making public the rest of their defined categories, this should be followed up by data-related regulation conversations. - The service shows the details of each ad, BUT actually only show the ad identifier, the link to the page, and the ad content, ONLY the
Issues, Elections or Politicsads contain information about the demographics of the audiences, the rest of the ads do NOT contain demographics
Interface example:

Once the search is defined one can browse all the ads related to the keywords entered. Additional filters will appear. Active/Inactive, Advertiser, Platform and Impressions by date.

One can see the details of each ad, but important to note that the details only show the ad identifier, the link to the page, and the ad content, any information about the demographics of the audiences that this ad was shown NOT presented by Facebook

Using the API is impossible to do the last query since Facebook is only making public the parameter POLITICAL_AND_ISSUE_ADS therefore the rest of the ads are not accessible via the API

The following EU technical in ads transparency report documents the use of Facebook Graph API and how was used for analysing general elections ads https://adtransparency.mozilla.org/eu/methods/
There are a number of uses by accessing the Facebook Ads Library service, normally, marketers will get inspiration or impact in different campaigns worldwide. But also a general user can look back to certain ads that have been seen in the past to get details about the products that were offered. The database is huge, and in principle, having a suspicious Facebook Page id or a particular keyword combination one could aim to collect evidence of unfair or illegal practices in advertising
The Facebook Graph API is the primary way for apps to read and write to the Facebook social graph. The official documentation is found in APIs and SDKs docs here. The Graph API has many uses, from creating and publish a game to analyze friends networks, naturally contains the Ads that Facebook publishes but only a limited amount of those, only limited to social issues and politics. In order to access and use the API, you need to gain access to the Facebook Ads Library API at https://www.facebook.com/ID and confirm your identity for running (or analysing) Ads About Social Issues, Elections or Politics, which involves receiving a letter with a code at your official account and sending picture identification to Facebook. Basically one has to be registered as an official Facebook developer, and this permission can actually take from one day to weeks.
The Facebook API has also a nice user interface (for developers) called the Graph API Explorer, which allows the developer or analyst to quickly generate access tokens, get code samples of the queries to run, or generate debug information to include in support requests. Here more info.
Requirements
- Register as a developer at developers.facebook.com
- Go to Graph API explorer and create an app
- Having a new app ID. Create a Token for the new app in the UI
- Define the Graph API Node to use:
ads_archive
An example query that can be retrieved from the Graph API Explorer is the following
ads_archive?access_token=[TOKEN]
&ad_type=POLITICAL_AND_ISSUE_ADS
&ad_active_status=ALL
&fields=ad_creation_time%2Cad_creative_body2Cpage_name%2Cdemographic_distribution
&limit=100
&ad_reached_countries=NL
&search_terms=.
There are of course a number of clients that can perform API calls, in the following pilot data analysis we are using a Python implementation
Interface example:

The following section is divided into Data Collection and Descriptive Statistics in order to understand better the data, finally, we are going to briefly discuss the insights.
Facebook Ads - Data Collection¶
Max Woolf's facebook-ad-library-scraper it's the best out-the-box solution (as of early 2021) to to retrive ads data from a Python client, since it requires minimal dependencies.
Usage: Configure the script via
config.yaml. Go to https://developers.facebook.com/tools/explorer/ to get a User Access Token, and fill it in with your token (it normally expires after a few hours but you can extend it to a 2 month token via the Access Token Debugger). Change other parameters as necessary.Run the scraper script: In order to install an run the only requirements we use the following:
!pip3 install requests tqdm plotly
!python fb_ad_lib_scraper.py
- Outputs: This script outputs three CSV files in an ideal format to be analyzed.
fb_ads.csv: The raw ads and their metadata.
fb_ads_demos.csv: The unnested demographic distributions of people reached by ads, which can be mapped to fb_ads.csv via the ad_id field.
fb_ads_regions.csv: The unnested region distributions of people reached by ads, which can be mapped to fb_ads.csv via the ad_id field.
The following notebook extracts over 8000 inactive ads by querying "stem" filtering therefore the ones related to the lections and setting a manageable limit for the API.
The data: The fields include all details about the ads, like creation time, link, description, and caption. Moreover, each ad is associated with a funding entity and a source page, this source page is normally the marketer or the original Facebook page associated with the ad, finally, there is information on the amount spend, and demographics like gender and regions.
Age groups: - 18-24 - 25-34 - 35-44 - 45-54 - 55-64 - 65+ Gender groups: - male - female - unknown Regions: - Noord-Holland - Zuid-Holland - Gelderland - North Brabant - Utrecht - Groningen - Overijssel - Flevoland - Limburg - Friesland - Zeeland
Extracting and reading the data¶
!python fb_ad_lib_scraper.py
29%|██████████▉ | 8601/30000 [01:26<03:34, 99.56it/s]Traceback (most recent call last):
File "fb_ad_lib_scraper.py", line 61, in <module>
for demo in ad['demographic_distribution']:
KeyError: 'demographic_distribution'
29%|██████████▉ | 8621/30000 [01:26<03:34, 99.78it/s]
import pandas as pd
df_demographics = pd.read_csv('data/fb_ads_demos.csv')
df_regions = pd.read_csv('data/fb_ads_regions.csv')
df_ads = pd.read_csv('data/fb_ads.csv')
Facebook Ads - Descriptive Statistics¶
The following section is focused on the statistical methodologies for describing the key insights of the Facebook Ads data, it is important to note that no hypothesis testing is performed in the current pilot data analysis, meaning that no correlation analysis or causal effects are studied. The purpose of this section is to understand the key insights of the data and therefore ask questions about ads practices.
Number of unique ads¶
The number of unique ads considered for this pilot data analysis is:
len(df_ads.ad_id.unique())
8621
Ads Impressions Period¶
The ads run following the configuration set up that was made by the campaign creator, normally the conditions are that the campaign will run until the funds are exhausted, and there is a declared min and max per ad. Following this logic, some ads can run for hours, but some others for days, here we find the average, max, min and count of each ad
# Converting to datetime
df_ads[['ad_delivery_start_time', 'ad_delivery_stop_time']] =\
df_ads[['ad_delivery_start_time', 'ad_delivery_stop_time']]\
.apply(pd.to_datetime)
# Ads delivery period
df_ads['ad_delivery_start_time'].describe(datetime_is_numeric=True)
count 8619 mean 2021-01-02 23:17:33.811346944 min 2019-09-24 00:00:00 25% 2021-01-21 00:00:00 50% 2021-03-10 00:00:00 75% 2021-03-14 00:00:00 max 2021-07-16 00:00:00 Name: ad_delivery_start_time, dtype: object
The period of the dataset that was extracted from the API is 2019-09, to 2021-07. It is taken this period of time, since one can't specify the period in the API call, the date filter has to be done at the data level. For this analysis, we are taking 2021-01-01 to 2021-05-01.
# Filtering the period and adjusting
df_ads = df_ads\
.query("ad_delivery_start_time >= '2021-01-01' & ad_delivery_start_time < '2021-05-01'")
#unique campaigns
#df_ads.drop_duplicates(subset=['ad_creative_body'], keep='first')
# Creating a new feature, ads time: how long the ads were displayed in days
df_ads['ads_time'] = df_ads['ad_delivery_stop_time'] - df_ads['ad_delivery_start_time']
# Treat them as integer is useful also
df_ads['ads_days_time'] = pd.to_numeric(df_ads['ads_time'].dt.days, downcast='integer')
# Ads general stats
df_ads['ads_time'].describe()
count 6485 mean 4 days 07:32:59.028527370 std 5 days 08:53:42.355982903 min 0 days 00:00:00 25% 1 days 00:00:00 50% 3 days 00:00:00 75% 5 days 00:00:00 max 43 days 00:00:00 Name: ads_time, dtype: object
The campaigns are online in average 4 days and 7 hours, with a standard deviation of 5 days, the max campaign duration is 43 days.
df_ads.sort_values(by='ads_time',ascending=False).head(5)
| ad_id | page_id | page_name | ad_creative_body | ad_creative_link_caption | ad_creative_link_description | ad_creative_link_title | ad_delivery_start_time | ad_delivery_stop_time | funding_entity | impressions_min | spend_min | spend_max | ad_url | currency | ads_time | ads_days_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6448 | 161990739027528 | 1550088745275913 | DENK | DENK wil dat ondernemers meer steun krijgen ti... | bewegingdenk.nl | CoronapandemieSteun DENK Steunen van het MKB D... | Wij willen ondernemers steunen! | 2021-01-21 | 2021-03-05 | DENK | 15000 | 0 | 99 | https://www.facebook.com/ads/library/?id=16199... | EUR | 43 days | 43 |
| 6453 | 867048690534134 | 1550088745275913 | DENK | DENK wil dat ondernemers meer steun krijgen ti... | www.bewegingdenk.nl | NaN | Wij willen ondernemers steunen! | 2021-01-21 | 2021-03-05 | DENK | 3000 | 0 | 99 | https://www.facebook.com/ads/library/?id=86704... | EUR | 43 days | 43 |
| 6447 | 440021530742598 | 1550088745275913 | DENK | DENK wil dat ondernemers meer steun krijgen ti... | bewegingdenk.nl | CoronapandemieSteun DENK Steunen van het MKB D... | Wij willen ondernemers steunen! | 2021-01-21 | 2021-03-05 | DENK | 20000 | 0 | 99 | https://www.facebook.com/ads/library/?id=44002... | EUR | 43 days | 43 |
| 6456 | 484767479583459 | 1550088745275913 | DENK | DENK wil dat ondernemers meer steun krijgen ti... | www.bewegingdenk.nl | NaN | Wij willen ondernemers steunen! | 2021-01-21 | 2021-03-05 | DENK | 10000 | 0 | 99 | https://www.facebook.com/ads/library/?id=48476... | EUR | 43 days | 43 |
| 6455 | 443594963659555 | 1550088745275913 | DENK | DENK wil dat ondernemers meer steun krijgen ti... | www.bewegingdenk.nl | NaN | Wij willen ondernemers steunen! | 2021-01-21 | 2021-03-05 | DENK | 2000 | 0 | 99 | https://www.facebook.com/ads/library/?id=44359... | EUR | 43 days | 43 |
The longest campaign was the following ad from the DENK party, mentioning "coronacrisis". This ad was online for 43 days.

Here the actual URL to the archive:
https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=NL&q=161990739027528&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all
Authenticity of Links¶
As we could see, the ads would normally link you to the party page. However, some mischievous advertisers and marketers sometimes would use the ads to promote certain links or websites, this website would link to a scam page or sometimes viruses. One way to check the authenticity of the links is to actually compare the link URL to the name of the original page. Following this approach if the page name is D66 then the associated link is d66.nl or similar, the same for not partisan websites like KiesKlimaat and its link kiesklimaat.nl. Obviously, there would be a number of false positives, nevertheless is a good indication of dubious links.
It is used the Levenshtein distance to measure the similarity between the link and the page name, the closer to 0 the similar the names.
import Levenshtein as lev
# Get the page name and the link to compare the authenticity
df = df_ads[['ad_creative_link_caption','page_name']].dropna()
df = df[df.ad_creative_link_caption.str.contains(".")]
distances = []
for index, row in df.iterrows():
distances.append((row['page_name'], row['ad_creative_link_caption'], lev.distance(row['page_name'], row['ad_creative_link_caption'])))
# The closer the distance the more alike the link is to the advertiser name
df_distances = \
pd.DataFrame(distances)\
.rename(columns={0:'page_name',1:'link',2:'levenshtein_distance'})\
.drop_duplicates(subset=['link'], keep='first')\
.sort_values('levenshtein_distance',ascending=True)\
.reset_index(drop=True)
df_distances.tail(10)
| page_name | link | levenshtein_distance | |
|---|---|---|---|
| 266 | Atria, voor gendergelijkheid en vrouwengeschie... | stemgendergelijkheid.nl | 33 |
| 267 | International Campaign for Tibet Europe | savetibet.nl | 33 |
| 268 | Feniks, Emancipatie Expertise Centrum Tilburg | fenikstilburg.nl | 35 |
| 269 | Samen Kerk in Nederland - SKIN | Verkiezingsdebat: internationale kerken & de p... | 40 |
| 270 | Mijn Stem | www.energievannoordoosttwente.nl/wup/weerselo | 41 |
| 271 | CDA | https://www.cda.nl/standpunten/krimpregio | 41 |
| 272 | Mijn Stem | www.energievannoordoosttwente.nl/wup/manderveen | 42 |
| 273 | TivoliVredenburg | Livestream: TivoliVredenburg viert Vrouwendag ... | 43 |
| 274 | Rectoraat San Thomas d'Aquino | Actiegroep; lijn 2 Holtenbroek per direct teru... | 54 |
| 275 | Multicultureel Jongeren Geluid | Het grote (online) Schilderswijkdebat onder le... | 57 |
There are no immediate insights of dubious links in the Ads before the Dutch general elections. However, a way to verify this is to export this list and manually fact-checking authenticity.
The following example shows how a link is similar to the page name and therefore is having a distance closer to 0.
df_distances.head(5)
| page_name | link | levenshtein_distance | |
|---|---|---|---|
| 0 | D66 | d66.nl | 4 |
| 1 | BIJ1 | BIJ1.org | 4 |
| 2 | Inwonersbelangen | inwonersbelangen.nl | 4 |
| 3 | Dierenbescherming | dierenbescherming.nl | 4 |
| 4 | ifaw | ifaw.org | 4 |
Amount of Ads Rate Change¶
What are the peaks and rises on the ads shown in the last weeks before the election? It would be interesting to note that some advertisers and pages suddenly invested the whole budget way too early, leaving fewer impressions towards the end.
# Group the advertisers by month
feb_march =\
df_ads.groupby(['page_name', df_ads.ad_delivery_start_time.dt.to_period('M').astype(str)])\
.size()\
.loc[slice(None), slice('2021-02', '2021-03')]\
.unstack()
# Top 10 rate of change between the two months
feb_march = feb_march\
.assign(change = (feb_march['2021-03'] - feb_march['2021-02']) / feb_march['2021-03'])\
.sort_values('change')
feb_march.head(10)
| ad_delivery_start_time | 2021-02 | 2021-03 | change |
|---|---|---|---|
| page_name | |||
| ActionAid Nederland | 28.0 | 3.0 | -8.333333 |
| VluchtelingenWerk Nederland | 29.0 | 4.0 | -6.250000 |
| ROSE stories | 7.0 | 1.0 | -6.000000 |
| NLBeter | 32.0 | 5.0 | -5.400000 |
| Nederland KansRijk | 26.0 | 5.0 | -4.200000 |
| D66 Medemblik | 4.0 | 1.0 | -3.000000 |
| Dierenbescherming | 104.0 | 39.0 | -1.666667 |
| BN DeStem | 5.0 | 2.0 | -1.500000 |
| Milieudefensie | 33.0 | 16.0 | -1.062500 |
| Vereniging Basisinkomen | 2.0 | 1.0 | -1.000000 |
feb_march.tail(10)
| ad_delivery_start_time | 2021-02 | 2021-03 | change |
|---|---|---|---|
| page_name | |||
| Wieke Paulusma - kandidaat Tweede Kamerlid D66 | 1.0 | NaN | NaN |
| Windmolens en zonneweide in Baarle Nassau | NaN | 1.0 | NaN |
| Woonbond | NaN | 36.0 | NaN |
| Wopke Hoekstra | NaN | 933.0 | NaN |
| World Animal Protection Nederland | NaN | 6.0 | NaN |
| Wybren van Haga | NaN | 4.0 | NaN |
| Wytske Postma CDA | NaN | 1.0 | NaN |
| de Bergse VVD | NaN | 5.0 | NaN |
| terug naar de Bijbel | NaN | 1.0 | NaN |
| السوريين في هولندا ,مشاكل وحلول | NaN | 1.0 | NaN |
There are some advertisers that are obviously non political, we won't include them. Facebook mislabel them since inlcudes keywords related to social issues. It also appears that Wopke Hoekstra went "all in" towards the end.
Number of Ads by Advertiser¶
Who is actually the page/party that is putting more ads on Facebook?
To answer this, we simply make a count of unique ads by each one of the pages, it is interesting to note that certain campaigns of the same party run in parallel.
# ploting library
import plotly.express as px
# Aggregation and counting
df_ads_count = df_ads\
.groupby('page_name')\
.count()['ad_id']\
.reset_index()\
.sort_values(by='ad_id',ascending=False)\
.rename(columns={'ad_id':'Ads Count', 'page_name':'Advertiser'})\
.head(20)
px.bar(df_ads_count.sort_values(by='Ads Count',ascending=True),
x='Ads Count', y='Advertiser', labels=None,
orientation='h', color='Ads Count', color_continuous_scale='blues',title='Number of Ads by Advertiser')
The party CDA was by far the one with more unique ads created with more than 2k (counting CDA together with the personal Wopke Hekstra page), followed by smaller parties like Volt with 409 and DENK with 329
Total Spent by Advertiser¶
Who is actually the big spender in this game? The ads details contain the max and min budget by ad impression that each ad is budgeted, we have calculated the median of those points and accumulate the amount in Euros per page/party.
# For calculation
import numpy as np
# Calculate median of impression and ad spend ranges to get a more realistic estimate
df_ads['spend_median'] = df_ads[['spend_min', 'spend_max']].apply(np.median, axis = 1)
#df_ads['impressions_median'] = df_ads[['impressions_min', 'impressions_max']].apply(np.median, axis = 1)
# Aggregation and counting
df_ads_spend = df_ads.query("currency == 'EUR'")\
.groupby('page_name')\
.sum()['spend_median']\
.reset_index()\
.sort_values(by='spend_median',ascending=False)\
.rename(columns={'spend_median':'EUR Spent', 'page_name':'Advertiser'})\
.head(20)
px.bar(df_ads_spend.sort_values(by='EUR Spent',ascending=True), x='EUR Spent', y='Advertiser', labels=None,
orientation='h', color='EUR Spent', color_continuous_scale='blues',title='Number of Ads by Advertiser')
Even though FvD created only 94 unique ads, they were the big spenders, potentially meaning that they actually put less effort on the creation and simply more budget per ad. This could only mean that they have a more consistant message (at least from the ads).
Demographic Groups Distribution¶
The Facebook Ads Library does not include any information of the actual people that were exposed to the ads, this would be an actual problem for Facebook nowadays. Instead, the API provides aggregated statistics of the demographics groups that were displayed by each ad.
For example, the ad number 785007302430176 was displayed 60% to men and 40% to women, as well as 80% to 65+ and 20%, to '18-24' group. We could therefore ask different questions such as. Which groups are longer exposed?
# Maximum exposure time of advertisement per group
df_demographics = \
df_demographics\
.merge(df_ads[['ad_id','ads_time','ads_days_time']], on='ad_id', how='left')
df_demographics\
.groupby(['age','gender'])\
.describe()['ads_time']\
.reset_index()\
.sort_values(by='mean',ascending=False)\
.head(10)
| age | gender | count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13-17 | female | 237 | 7 days 10:25:49.367088607 | 6 days 22:39:35.312600885 | 0 days 00:00:00 | 2 days 00:00:00 | 4 days 00:00:00 | 12 days 00:00:00 | 23 days 00:00:00 |
| 1 | 13-17 | male | 239 | 7 days 09:08:17.071129707 | 6 days 23:11:37.949375119 | 0 days 00:00:00 | 2 days 00:00:00 | 4 days 00:00:00 | 12 days 00:00:00 | 23 days 00:00:00 |
| 2 | 13-17 | unknown | 434 | 7 days 01:59:26.820276497 | 7 days 03:45:03.555578597 | 0 days 00:00:00 | 2 days 00:00:00 | 4 days 00:00:00 | 11 days 00:00:00 | 43 days 00:00:00 |
| 21 | Unknown | unknown | 733 | 6 days 14:59:45.266030013 | 6 days 22:36:43.674256390 | 0 days 00:00:00 | 2 days 00:00:00 | 4 days 00:00:00 | 10 days 00:00:00 | 43 days 00:00:00 |
| 4 | 18-24 | male | 6485 | 4 days 07:32:59.028527370 | 5 days 08:53:42.355982903 | 0 days 00:00:00 | 1 days 00:00:00 | 3 days 00:00:00 | 5 days 00:00:00 | 43 days 00:00:00 |
| 5 | 18-24 | unknown | 6485 | 4 days 07:32:59.028527370 | 5 days 08:53:42.355982903 | 0 days 00:00:00 | 1 days 00:00:00 | 3 days 00:00:00 | 5 days 00:00:00 | 43 days 00:00:00 |
| 20 | 65+ | unknown | 6485 | 4 days 07:32:59.028527370 | 5 days 08:53:42.355982903 | 0 days 00:00:00 | 1 days 00:00:00 | 3 days 00:00:00 | 5 days 00:00:00 | 43 days 00:00:00 |
| 19 | 65+ | male | 6485 | 4 days 07:32:59.028527370 | 5 days 08:53:42.355982903 | 0 days 00:00:00 | 1 days 00:00:00 | 3 days 00:00:00 | 5 days 00:00:00 | 43 days 00:00:00 |
| 18 | 65+ | female | 6485 | 4 days 07:32:59.028527370 | 5 days 08:53:42.355982903 | 0 days 00:00:00 | 1 days 00:00:00 | 3 days 00:00:00 | 5 days 00:00:00 | 43 days 00:00:00 |
| 17 | 55-64 | unknown | 6485 | 4 days 07:32:59.028527370 | 5 days 08:53:42.355982903 | 0 days 00:00:00 | 1 days 00:00:00 | 3 days 00:00:00 | 5 days 00:00:00 | 43 days 00:00:00 |
Distributions by Group¶
Moreover, we could explore which groups tend to be more microtargeted. Meaning that a campaign creator would only focus on a particular combination of demographical attributes. For instance "Female 18-24 group".
# Creating a cross-table between gender and age groups per percentage of impresions
pivoted_demographics = df_demographics\
.query("age != 'All (Automated App Ads)' & age != 'Unknown' & gender != 'All (Automated App Ads)' & percentage > 0.3")\
.pivot_table(values='percentage', index=['gender','ad_id'], columns=['age'], aggfunc='max')\
.reset_index()\
.drop(columns=['ad_id'])
pivoted_demographics['gender_code'] = pivoted_demographics['gender']
pivoted_demographics\
.gender_code\
.update(pivoted_demographics.gender_code.map({'unknown':0,'male':1,'female':2}))
fig = px.parallel_coordinates(pivoted_demographics,
color="gender_code",
color_continuous_scale=[(0.00, "white"), (0.33, "white"),
(0.33, "red"), (0.66, "red"), #male
(0.66, "teal"), (1.00, "teal")]) #female
fig.update_layout(coloraxis_colorbar=dict(
title="Percentage by Grpup",
tickvals=[1,2,3],
ticktext=["Male","Female","Unknown"],
lenmode="pixels", len=150,
))
fig.show()
The only outstanding patterns are that there is a higher tendency to micro-target females, we can see that in each peak of the age group, that the female group received 100% of the ads.
px.scatter(df_demographics.query("age != 'All (Automated App Ads)' & age != 'Unknown' & gender != 'All (Automated App Ads)'"),
x="age", y="ads_days_time", color="percentage", color_continuous_scale='RdBu',
category_orders={"age": ["13-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65+"]},
labels=dict(age="Age Group", ads_days_time="Ads Impression in Days", percentage="Percentage"),
size='percentage', hover_data=['gender'],title='Ads Exposure by Age Group')
No matter the age group, normally the campaigns would run proportionally, the youngest group 13-17 still received some ads, even thought they don't vote and the 65+ are the highest microtargeted
Microtargeted Ads¶
Given the above observations, we could explore further the more clever marketing tricks, this microtargeting that can reach audiences in the most optimal way, like for instance targeting one specific demographic group. To gather those we get only the ads are only displayed by region, gender and age quickly exhausting the budget.
ads_one_group = df_demographics\
.query("age !='All (Automated App Ads)' and percentage > 0.9" )\
.dropna()\
.sort_values('ads_days_time', ascending = True)
ads_one_group.head(20)
| ad_id | age | gender | percentage | ads_time | ads_days_time | |
|---|---|---|---|---|---|---|
| 96306 | 430691768045335 | 35-44 | male | 1.000000 | 0 days | 0.0 |
| 104194 | 768477930746491 | 18-24 | female | 1.000000 | 0 days | 0.0 |
| 96414 | 961250341291492 | 65+ | male | 1.000000 | 0 days | 0.0 |
| 38470 | 2672202179737865 | 18-24 | male | 1.000000 | 0 days | 0.0 |
| 38920 | 801162894083104 | 25-34 | female | 1.000000 | 0 days | 0.0 |
| 38956 | 3693120407474936 | 55-64 | male | 1.000000 | 0 days | 0.0 |
| 39244 | 785007302430176 | 55-64 | female | 1.000000 | 0 days | 0.0 |
| 86874 | 813444159242671 | 25-34 | female | 1.000000 | 0 days | 0.0 |
| 88144 | 249667446793436 | 25-34 | female | 1.000000 | 0 days | 0.0 |
| 100660 | 268098691427331 | 35-44 | female | 0.923077 | 0 days | 0.0 |
| 88638 | 335310441243417 | 18-24 | male | 1.000000 | 0 days | 0.0 |
| 89169 | 124698196244795 | 18-24 | female | 1.000000 | 0 days | 0.0 |
| 90617 | 427837775138690 | 65+ | female | 1.000000 | 0 days | 0.0 |
| 96342 | 1254455388284614 | 65+ | male | 1.000000 | 0 days | 0.0 |
| 96378 | 129305299032805 | 45-54 | female | 1.000000 | 0 days | 0.0 |
| 96396 | 898997374229403 | 25-34 | female | 1.000000 | 0 days | 0.0 |
| 27211 | 999569177234876 | 35-44 | female | 1.000000 | 0 days | 0.0 |
| 27193 | 466503074394392 | 45-54 | female | 1.000000 | 0 days | 0.0 |
| 35176 | 819086475347103 | 18-24 | female | 1.000000 | 0 days | 0.0 |
| 27067 | 320354246113048 | 25-34 | female | 1.000000 | 0 days | 0.0 |
Top 3 ads that were specifically targeted to demographic groups in a quick period of time exhausting the budget reaching maximum audience.

Here the actual URLs to the archive:
Top ad by Region¶
Similarly to the demographic groups, one can aggregate the ads impressions by Dutch province.
#df_regions[df_regions['ad_id'] == 321114872792279].sort_values('percentage',ascending = False)
ads_one_region = df_regions\
.query("region !='All (Automated App Ads)' & percentage > 0.8" )\
.groupby('region')\
.count()['ad_id']\
.reset_index()\
.sort_values('ad_id')\
.merge(df_regions.groupby('region').count()['ad_id'].reset_index(), on='region', how='left')\
.rename(columns={'ad_id_x':'Targeted Ads', 'ad_id_y':'Total Ads'})
ads_one_region['Relative %'] = ads_one_region['Targeted Ads']/ads_one_region['Total Ads']*100
ads_one_region.head(10)
| region | Targeted Ads | Total Ads | Relative % | |
|---|---|---|---|---|
| 0 | Friesland | 60 | 8621 | 0.695975 |
| 1 | Drenthe | 79 | 4203 | 1.879610 |
| 2 | Zeeland | 91 | 8621 | 1.055562 |
| 3 | Flevoland | 108 | 8621 | 1.252755 |
| 4 | Groningen | 123 | 8621 | 1.426749 |
| 5 | Limburg | 367 | 8621 | 4.257047 |
| 6 | Utrecht | 449 | 8621 | 5.208213 |
| 7 | Overijssel | 462 | 8621 | 5.359007 |
| 8 | Zuid-Holland | 487 | 8621 | 5.648997 |
| 9 | Gelderland | 510 | 8621 | 5.915787 |
px.bar(ads_one_region, x='Relative %', y='region', labels=None,
orientation='h', color='Targeted Ads', color_continuous_scale='blues',title='Most Microtargeted Regions')
The reading is that 8 out of 100 Ads shown in Noord-Brabant aren't shown anywhere else. On the other hand almost all the ads shown in Friesland aren't intended to be be seen specifically for those regions.
# Grouping the regions and counting the ads over time
regions = ['Noord-Holland','Zuid-Holland','Gelderland','North Brabant',
'Utrecht','Groningen','Overijssel','Flevoland','Limburg','Friesland','Zeeland']
df_regions_date = \
df_regions[(df_regions['region'].isin(regions)) & (df_regions['percentage'] > 0.4)]\
.merge(df_ads, on='ad_id', how='inner')\
.pivot_table(values='ad_id',index='ad_delivery_start_time', columns='region',aggfunc='count',fill_value=None)
fig = px.line(df_regions_date.loc["2021-02-28":"2021-03-20",])
fig.add_vrect(x0="2021-03-15", x1="2021-03-17", row="all", col=1,
annotation_text="Election Days", annotation_position="top left",
fillcolor="green", opacity=0.25, line_width=0)
fig.show()
It is clear to see that during the election days the campaigns were finalized, and a small handful of them continued after the election for some reason. There is a clear peak on Friday (March 12) before the election, where the majority of the ads were displayed, also right before the election days.
Ads Topics¶
Performing N-grams analysis on the text of the ads. It is considered 1-grams to 4-grams terms using Dutch and English dictionaries, finally terms frequency and inverse terms frequency are compared.
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Taking into account only unique campaigns
ads_content = df_ads['ad_creative_body'].unique()
ads_content = [str(i).lower() for i in ads_content]
def clean_text_round(text):
'''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
text = text.lower()
text = re.sub('\[.*?\]', ' ', text)
text = re.sub('\w*\d\w*', ' ', text)
text = re.sub('<.*?>', ' ', text)
text = re.sub('\n', ' ', text)
text = re.sub('\t', ' ', text)
text = re.sub('\(b\)\(6\)', ' ', text)
text = re.sub('"', ' ', text)
text = re.sub('---', ' ', text)
return text
stop_words = set(stopwords.words(['english','dutch','turkish']))
your_list = ['stem','het','onze','we','mij','jou','nl','jouw','mee','wij', 'nl ',' ','jij','nan','per','word','nederland', 'kamer','partij','tweede','stemmen','den','gaat','https','daarom','cda','pvda','fvd','denk','oranje','vvd','groenlinks','sp','ga','www','code','verkiezingen','nummer']
for i, line in enumerate(ads_content):
ads_content[i] = ' '.join([str(x).lower() for
x in nltk.word_tokenize(line) if
( x not in stop_words ) and ( x not in your_list )])
# Getting n-grams table
def ngrams_table(n, list_texts):
vectorizer = CountVectorizer(ngram_range = (n,n))
X1 = vectorizer.fit_transform(list_texts)
features = vectorizer.get_feature_names()
# Applying TFIDF
vectorizer = TfidfVectorizer(ngram_range = (n,n))
X2 = vectorizer.fit_transform(list_texts)
# Getting top ranking features
sums1 = X1.sum(axis = 0)
sums2 = X2.sum(axis = 0)
data = []
for col, term in enumerate(features):
data.append( (term, sums1[0,col], sums2[0,col] ))
return pd.DataFrame(data, columns = ['term','rankCount', 'rankTFIDF']).sort_values('rankCount', ascending = False).reset_index(drop=True)
ads_content = [clean_text_round(text) for text in ads_content]
table_ngrams = pd.DataFrame()
for i in [1,2,3,4]:
table_ngrams = table_ngrams.append(ngrams_table(i, ads_content))
# drop weird first term
table_ngrams_plot = table_ngrams.iloc[1:,:]\
.sort_values(by='rankCount')\
.reset_index(drop=True)\
.rename(columns={'rankCount':'Term Frequency', 'rankTFIDF':'Inverse Term Frequency', 'term':'N-grams Keywords'})\
.sort_values('Inverse Term Frequency',ascending=False)
px.bar(table_ngrams_plot.head(40).sort_values(by='Inverse Term Frequency', ascending=True),
x='N-grams Keywords', y='Inverse Term Frequency', labels=None,
orientation='v', color='Term Frequency', color_continuous_scale='RdBu',title='Ads Terms Frecuency')
The n-grams analysis does not necessarily present key insights, mostly we can see generic terms, what would be interesting to analyse is the consistency of the advertisement and the actual campaign plans. Not only that, but this workflows can be adapted to be iterable so that we could see what topics are shown by demographic groups and regions.
Related work and Conslusion¶
There is also some interesting work conducting similar analysis, for example, Roberto Rocha from CBC News reports how 35,000 political ads on Facebook were analysed in Canadian elections. His main focus was to discuss how rules could affect advertising practices. Also, Ondrej Pekacek has created a very awesome monitor of ads in Czech elections. He aims to create an automated workflow, which would inform analysts covering the political communication and financing of Czech elections. Example dashboard
There are some other dozens of projects focused in the US, which are very interesting, but mostly focused on voter fraud conspiracies which is more related to scams in ads. Ultimately, we have discussed the limitations and piloted a data analysis of how the Facebook Ads Library can help on bringing transparency to advertisement practices and to democracy.
In this notebook, we have focused on the Dutch General Elections of 2021 finding not surprising insights, however, what is more, important is to open the discussion whether Facebook should be forced to open their Ads Library to access all types of ads and not only the "Social Issues, Elections or Politics" category, this is where its business model relies on, but also rapidly they're losing trust.